FIRST: Many thanks to Britta Schumacher for originally compiling these materials, and to Dr. Simona Picardi’s ggplot chapter, & Dr. Alison Horst’s tidyverse aRt.
Artwork by the incredible Alison Horst
ggplot2 is the data visualization package within the
tidyverse; it takes clean, organized, manipulated data, and
(with your direction) builds beautiful, & more importantly,
communicative, plots. ggplot2 is based on the
grammar of graphics, which asserts that we can build every
graph from the same few components:
To actually display data values, we map out our variables to
aesthetic things like size, color, and
x and y locations; we also tell
ggplot2 the type of visualization we are interested in
building (e.g., bar graph, box plot, line graph, density plots, etc.).
ggplot2 opens up data science to
broader audiences and helps all of us communicate our science. Download
this
cheat sheet and save it somewhere accessible–it’s an incredible tool to
refer back to!
ggplot2 works with data.frames
(tibbles), the data type we built in our previous
tidyverse workshop. The data we feed into
ggplot2 consists of rows and columns that “live peacefully”
together in a tidy data.frame. This means, often, we need
to do some data cleaning up front to get our data into the tidy
format we need for visualization.
What follows is a brief review of some core tidyverse
functions and their use for tidying and manipulating our data so that it
plays nice with ggplot. We will use a realish dataset
containing information on dragons (courtesy of the DALEX
package). This is the same data we used on the previous tidyverse
workshop.
In honor of Black Dragon Canyon Wash, let’s pretend we’re trying to understand how different species of chromatic dragons (a decently tempermental critter) with various life lengths compare in terms of number of scars and BMI.
#install.packages("DALEX")
library(tidyverse) # ggplot2 and more; or library(ggplot2)
library(DALEX) # for data
library(RColorBrewer) # for nice color palettes
data(dragons)
view(dragons)
We should first familiarize ourselves with the data.
dim(dragons) # view dimensions of the df
head(dragons) # view first 10 rows of df
tail(dragons) # view last 10 rows of df
str(dragons) # view data structure of df
colnames(dragons) # view the columns of df
Let’s quickly brush up on wrangling and clean up this data set for our purposes. We will:
Let’s create an efficient workflow by combining ALL
of these data wrangling steps; i.e., let’s review harnessing the great
POWERS of the tidyverse and pipes!
# recall how we clean and manipulate data
dragons_simple <- dragons %>%
filter(colour %in% c("black","blue")) %>% # filter for blue & back dragons
# select only relevant columns (height, weight, scars, colour, number_of_lost_teeth, and life_length)
select(2:5,7, life_length) %>%
# rename columns with terribly long names
rename(teeth = number_of_lost_teeth,
age = life_length) %>%
# create new columns, "genus", based on colour ... and "species", based on colour and age
mutate(genus = if_else(colour %in% c("blue"), "Sauroniops", "Jaggermeryx"),
species = case_when(
colour == "blue" & age < 1200 ~ "reike",
colour == "blue" & age > 1200 ~ "naida",
colour == "black" & age < 1000 ~ "ozzyi",
colour == "black" & (age >= 1000 & age <= 1700) ~ "whido",
colour == "black" & age > 1700 ~ "strummeri")) %>%
# create new column, BMI (formula = (weight*2000)/0.45359237 / height^2)
mutate(BMI = (weight*2000)/0.45359237 / height^2) %>%
# combine genus and species into one column
unite(genus_species, genus, species, sep = " ") %>%
# create categorical variables to group dragons by age
mutate(age_group = as.factor(case_when(
age < 1000 ~ "dragonling",
age >= 1000 & age <= 2000 ~ "juvenile",
age > 2000 ~ "adult")))
With this simplified and cleaned data set, we’re ready to start visualizing!
As we said before, ggplot needs 3 things: 1) tidy
data, 2) a set of geometries
(geoms), 3) a coordinate system
(aes mappings).
First box checked! More details on the second and third:
Geometries are functions that create the
shapes of your data (think points, lines, boxplots,
histograms). They control the essential backbones of the graphical
elements. These functions start with geom_.
The coordinate system (or
mappings) can be thought of as how those shapes are
organized. It tells your shapes what data to use and how. These
mappings are assigned with calls to aes(), which can be
done two major ways.
There are other functions, which I will call
“customizers,” which allow us to tweak the display of
our data in myriad ways. The most prominent of these are
themes, labels, scales, and guides.
Multiple graphical elements (including geoms and
customizers) can be added to the ggplot in any order, separated by
+.
Starting simple, we can make a scatterplot of the height and weight of our dragons.
Let’s plot only the rare species though (ie., NOT Sauroniops naida or Sauroniops reike), so we don’t have so many points.
scatter_data <- dragons_simple %>%
filter(genus_species!="Sauroniops reike" & genus_species!="Sauroniops naida")
Now let’s make just the basic scatterplot of all our rare dragons.
# base scatterplot
ggplot(scatter_data, aes(x = weight, y = height)) +
geom_point()
We can also color-code by species by using the
genus_species column as an additional aes()
mapping.
# color-coded points
ggplot(scatter_data, aes(x = weight, y = height, color = genus_species)) +
geom_point()
Cool! Now we have all the data we want to show, but the plot is still pretty rough around the edges. Let’s fix the legend, labels, and axis scales (which are common tasks for any publication figure) … and make the points a little easier to see too.
# make it pretty
ggplot(scatter_data, aes(x = weight, y = height, color = genus_species)) +
geom_point(size = 3) + # points bigger
labs(title = "Chromatic dragon size by species", color = "Species",
x = "Weight (tons)", y = "Height (meters)") + # add labels
#adjust scales and colors (make colorblind-friendly)
scale_x_continuous(limits = c(10,17),
breaks = seq(10,17,2)) +
scale_y_continuous(breaks = seq(30,70,5)) +
scale_color_manual(values = c('#1b9e77', '#d95f02', '#7570b3')) +
theme_bw() + # change the theme and move the legend
theme(legend.position = c(0.8,0.25))
# save this for later
scatter <- last_plot()
Now let’s look at a different part of the data, and make a set of boxplots that show the range of teeth lost by each species.
First, just the basic boxplot for all species grouped together.
# make one boxplot for all dragons
ggplot(dragons_simple, aes(y = teeth)) +
geom_boxplot()
# NOTE: the numbers on the unspecified (x) axis are meaningless, ignore for now
Similar to using color for points in the scatterplot, ggplot can also
break up our data for us in boxplots. Let’s make it a series of boxplots
by using the genus_species column as an additional
aes() mapping.
# make a boxplot for each species
ggplot(dragons_simple, aes(x = genus_species, y = teeth)) +
geom_boxplot()
… and we can make this prettier with some color and “customizers!”
# make it pretty
ggplot(dragons_simple, aes(x=genus_species, y=teeth, fill=genus_species)) +
geom_boxplot() + # make a boxplot
labs(x = "", y = "Teeth Lost", fill = "Species") + # fix labels
scale_fill_brewer(palette = 'Dark2') + # same colors as before
theme_minimal() + # change the preset theme
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow=2)) # give legend two rows
# save this for later
boxes <- last_plot()
For this exercise, we will randomly assign our dragons to different diets and plot some trait summaries by species and diet. Let’s first isolate data we want to visualize by:
pivot_longer() into long format for visualizingset.seed(121) # to ensure everyone gets the same random numbers
means <- dragons_simple %>%
# randomly assign diet groups
mutate(random = runif(nrow(.),0,1),
diet = if_else(random>=0.5,'control','gourmet')) %>%
# group by age group and species
group_by(diet,genus_species) %>%
# summarize mean scars, BMI, and teeth lost
summarize(scars = mean(scars),
BMI = mean(BMI),
teeth = mean(teeth),
.groups='drop') %>%
# pivot_longer() to format for column graph
pivot_longer(cols = c(scars, BMI, teeth),
names_to = "summary_var", values_to = "mean_value")
ses <- dragons_simple %>%
# randomly assign diet groups
mutate(random = runif(nrow(.),0,1),
diet = if_else(random>=0.5,'control','gourmet')) %>%
# group by age group and species
group_by(diet,genus_species) %>%
# summarize sd scars, BMI, and teeth lost
summarize(scars = sd(scars, na.rm=T)/sqrt(n()),
BMI = sd(BMI, na.rm=T) /sqrt(n()),
teeth = sd(teeth, na.rm=T)/sqrt(n()),
.groups='drop') %>%
# pivot_longer() to format for column graph
pivot_longer(cols = c(scars, BMI, teeth),
names_to = "summary_var", values_to = "se_value")
bar_data <- left_join(means, ses,
by=c('diet','genus_species','summary_var'))
Now that we have our data summarized and in the right format, we’re ready to make a plot! We want to:
Our goal is something like this:
To create the bones of the plot, we will need geom_col()
like before, but we will have to combine it with
facets, which essentially create multiple plots with
the same geometries, but using different subsets of data.
ggplot(bar_data, aes(x = genus_species, y = mean_value, fill = diet)) +
facet_wrap(~summary_var) + # create separate panels for each trait
geom_col() + # separate columns for each diet group
geom_errorbar(aes(xmin=mean_value-se_value,
xmax=mean_value+se_value))
There are several things wrong with this plot… What can we do about them?.. Might have to dig into some of the arguments for the geometries…
?geom_col
?geom_errorbar
Ah ha! So we had a typo (probably from copying code from Stack
Exchange) and it looks like the default position argument
is a little wacky for our purposes. What should it be?
ggplot(bar_data, aes(x = genus_species, y = mean_value, fill = diet)) +
facet_wrap(~summary_var) + # create separate panels for each trait
# separate columns for each age group - now dodged instead of stacked
geom_col(position = 'dodge') +
geom_errorbar(aes(ymin=mean_value-se_value, #change that "x" to a "y"
ymax=mean_value+se_value),
position = position_dodge(0.9), width=0.3)
# NOTE: trying position = "dodge" and width = 1 might tell you something about what is going on under the hood
Now that we have the actual graph looking correct, we can customize it:
ggplot(bar_data, aes(x = genus_species, y = mean_value, fill = diet)) +
facet_wrap(~summary_var) + # create separate panels for each trait
geom_col(position = 'dodge', color='white') + # separate columns for each diet
geom_errorbar(aes(ymin=mean_value-se_value,
ymax=mean_value+se_value),
position = position_dodge(0.9), width=0.3) +
labs(title = "Chromatic dragons on different diets", x = "", y = "",
fill = "diet") + # add labels
scale_y_continuous(expand = c(0,0,0,1)) + # adjust space on top and bottom
scale_fill_manual(values = c('grey35','grey70')) + # change colors
theme_bw() + # change the theme
theme(legend.position = 'bottom',
axis.text.x = element_text(angle = 45, hjust = 0.9),
strip.text = element_text(size = 14))
Looks pretty close to our goal, doesn’t it!
# save this for later
bar <- last_plot()
PatchworkThe following is adapted from materials designed by Simona Picardi here.
Artwork by the incredible Alison Horst
The package patchwork is a ggplot extension that allows
you to combine and arrange multiple plots into a single plot with
several panels. Let’s load in the package (after installing it):
#install.packages("patchwork")
library(patchwork)
The first step to combine different plots is to store each plot by
assigning it its own name. Each will be stored in the environment as a
ggplot2 object. For example, let’s take our plots from
earlier and call them p1, p2, and
p3:
p1 <- scatter
p2 <- boxes
p3 <- bar
The syntax for arranging plots uses combinations of three symbols:
+, /, and |. Using +
is the simplest option and it can be combined with the
plot_layout function to tweak how many rows and columns the
plots are arranged on (among other things). If no layout is specified,
plots are aligned in row order:
p1 + p2 + p3
Specify 2 rows:
p1 + p2 + p3 + plot_layout(nrow = 2)
We can enforce hierarchy/groups with parentheses:
p1 + (p2 + p3) + plot_layout(nrow = 2)
The symbol | means “side by side” and the symbol
/ means “one on top of each other”. We can combine these to
obtain any layout we want:
p1 | p2 / p3
(p1 | p2) / p3
(p1 / p2) | p3
Here is an example of more patchwork functionality.
Added panel tags, and used ggplot2 guides to
create a universal legend:
You probably want to occasionally save your plots for later viewing
or publication. It may be be tempting to save a plot by clicking
Export on the RStudio plot panel. This way is crude and
frustrating. Instead, use ggsave! ggsave is
the ggplot2 function that allows you to save your plots and
specify exactly how they will appear in the end. By default, if you
don’t assign a name to a plot and specify its name in
ggsave, it will assume you want to save the last plot you
ran.
This is the breakdown:
ggsave(filename = "./out-plots/dragon_patchwork.tiff",
device = "tiff", # tiff or pdf for saving publication-quality figures
width = 14, # define the width of the plot in your units of choice
height = 8, # define the height of the plot in your units of choice
units = "in", # define your units of choice ("in" stands for inches)
dpi = 400) # you can control exactly how many dots per inches your plot has, which comes in handy when the journal guidelines have a specific requirement
## Warning in grDevices::dev.off(): unable to open TIFF file './out-plots/
## dragon_patchwork.tiff'
There are just about a bazillion different ways to visualize the same data:
Thank you stackoverflow, for this wild viz of all the ways we coulod visualize data about diamonds!
… In the end we just have to choose the method that makes the most sense for our audience, our research questions and our data!
Additionally, aRtists everywhere are using R to make ridiculously cool pieces of generative and patterned art. Check out some talented RLadies here and here! Plus, #TidyTuesday submissions can be absolutely, wildly, brilliantly, good.
# whatever code you want goes here!